Tutorial 0: Data Preparation¶
Welcome to the Cell2Sentence (C2S) tutorial series! In this first tutorial, we will guide you through the process of preparing and preprocessing a single-cell RNA sequencing dataset for use with C2S. This step is important because the processed data produced in this notebook will be used in all subsequent tutorials, so make sure to complete this notebook first before moving on to the others if you are using the example data.
If you are processing your own dataset, keep a few things in mind:
- This notebook assumes you are starting from a single-cell dataset with raw transcript counts. You can follow these steps to preprocess your dataset for use with C2S.
- If you are starting from pre-normalized data, keep in mind that C2S normalization consists of the standard filtering & count normalization steps outlined by Scanpy, with the exception that the log1p transformation is applied with base 10. For best results with the inverse reconstruction (going from cell sentences back to expression, shown in tutorial 1), make sure to use base-10 normalized data as shown in this tutorial.
We will be working with a subset of data from the Immune System tissue dataset by Domínguez Conde et al. This dataset provides an excellent foundation for demonstrating how to convert raw single-cell data into a format that can be used with C2S models.
- Dataset Information:
- Source: Immune System tissue dataset from Domínguez Conde et al. (2022)
- Link: Science Article
- Citation: Domínguez Conde, C., et al. "Cross-tissue immune cell analysis reveals tissue-specific features in humans." Science 376.6594 (2022): eabl5197.
- Subset Used: We use data from two donors (A29 and A31), totaling 29,773 cells. This subset is chosen to keep the tutorial data manageable.
First, we will import the necessary libraries. These include general-purpose Python libraries, as well as specialized libraries for single-cell RNA sequencing data analysis.
from __future__ import annotations #default now for name.error issue
import os
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
import scanpy as sc
import anndata as ad
from collections import Counter #count table
sc.set_figure_params(dpi=300, color_map="viridis_r", facecolor="white", )
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
| Package | Version |
|---|---|
| numpy | 2.3.3 |
| pandas | 2.3.3 |
| matplotlib | 3.10.6 |
| seaborn | 0.13.2 |
| scanpy | 1.11.4 |
| anndata | 0.12.2 |
| Component | Info |
| Python | 3.11.13 | packaged by conda-forge | (main, Jun 4 2025, 14:48:23) [GCC 13.3.0] |
| OS | Linux-5.14.0-503.22.1.el9_5.x86_64-x86_64-with-glibc2.34 |
| CPU | 64 logical CPU cores, x86_64 |
| GPU | No GPU found |
| Updated | 2025-11-11 15:42 |
Dependencies
| Dependency | Version |
|---|---|
| ipython | 9.6.0 |
| defusedxml | 0.7.1 |
| python-dateutil | 2.9.0.post0 |
| prompt_toolkit | 3.0.52 |
| donfig | 0.8.1.post1 |
| packaging | 25.0 |
| matplotlib-inline | 0.1.7 |
| pyzmq | 27.1.0 |
| pillow | 11.3.0 |
| texttable | 1.7.0 |
| comm | 0.2.3 |
| debugpy | 1.8.17 |
| session-info2 | 0.2.2 |
| scikit-learn | 1.7.2 |
| pyparsing | 3.2.5 |
| Pygments | 2.19.2 |
| numba | 0.62.1 |
| kiwisolver | 1.4.9 |
| typing_extensions | 4.15.0 |
| msgpack | 1.1.1 |
| six | 1.17.0 |
| psutil | 7.1.0 |
| jupyter_client | 8.6.3 |
| asttokens | 3.0.0 |
| pytz | 2025.2 |
| tornado | 6.5.2 |
| jedi | 0.19.2 |
| cycler | 0.12.1 |
| threadpoolctl | 3.6.0 |
| stack_data | 0.6.3 |
| scipy | 1.16.2 |
| legacy-api-wrap | 1.4.1 |
| PyYAML | 6.0.3 |
| natsort | 8.4.0 |
| executing | 2.2.1 |
| h5py | 3.14.0 |
| patsy | 1.0.1 |
| joblib | 1.5.2 |
| pure_eval | 0.2.3 |
| traitlets | 5.14.3 |
| parso | 0.8.5 |
| colorama | 0.4.6 |
| zarr | 3.1.3 |
| jupyter_core | 5.8.1 |
| numcodecs | 0.16.1 |
| platformdirs | 4.4.0 |
| setuptools | 78.1.1 |
| statsmodels | 0.14.5 |
| llvmlite | 0.45.1 |
| leidenalg | 0.10.2 |
| igraph | 0.11.9 |
| charset-normalizer | 3.4.3 |
| crc32c | 2.7.1 |
| ipykernel | 6.30.1 |
| ipywidgets | 8.1.5 |
| decorator | 5.2.1 |
| wcwidth | 0.2.14 |
Copyable Markdown
| Package | Version | | ---------- | ------- | | numpy | 2.3.3 | | pandas | 2.3.3 | | matplotlib | 3.10.6 | | seaborn | 0.13.2 | | scanpy | 1.11.4 | | anndata | 0.12.2 | | Dependency | Version | | ------------------ | ----------- | | ipython | 9.6.0 | | defusedxml | 0.7.1 | | python-dateutil | 2.9.0.post0 | | prompt_toolkit | 3.0.52 | | donfig | 0.8.1.post1 | | packaging | 25.0 | | matplotlib-inline | 0.1.7 | | pyzmq | 27.1.0 | | pillow | 11.3.0 | | texttable | 1.7.0 | | comm | 0.2.3 | | debugpy | 1.8.17 | | session-info2 | 0.2.2 | | scikit-learn | 1.7.2 | | pyparsing | 3.2.5 | | Pygments | 2.19.2 | | numba | 0.62.1 | | kiwisolver | 1.4.9 | | typing_extensions | 4.15.0 | | msgpack | 1.1.1 | | six | 1.17.0 | | psutil | 7.1.0 | | jupyter_client | 8.6.3 | | asttokens | 3.0.0 | | pytz | 2025.2 | | tornado | 6.5.2 | | jedi | 0.19.2 | | cycler | 0.12.1 | | threadpoolctl | 3.6.0 | | stack_data | 0.6.3 | | scipy | 1.16.2 | | legacy-api-wrap | 1.4.1 | | PyYAML | 6.0.3 | | natsort | 8.4.0 | | executing | 2.2.1 | | h5py | 3.14.0 | | patsy | 1.0.1 | | joblib | 1.5.2 | | pure_eval | 0.2.3 | | traitlets | 5.14.3 | | parso | 0.8.5 | | colorama | 0.4.6 | | zarr | 3.1.3 | | jupyter_core | 5.8.1 | | numcodecs | 0.16.1 | | platformdirs | 4.4.0 | | setuptools | 78.1.1 | | statsmodels | 0.14.5 | | llvmlite | 0.45.1 | | leidenalg | 0.10.2 | | igraph | 0.11.9 | | charset-normalizer | 3.4.3 | | crc32c | 2.7.1 | | ipykernel | 6.30.1 | | ipywidgets | 8.1.5 | | decorator | 5.2.1 | | wcwidth | 0.2.14 | | Component | Info | | --------- | ------------------------------------------------------------------------------ | | Python | 3.11.13 | packaged by conda-forge | (main, Jun 4 2025, 14:48:23) [GCC 13.3.0] | | OS | Linux-5.14.0-503.22.1.el9_5.x86_64-x86_64-with-glibc2.34 | | CPU | 64 logical CPU cores, x86_64 | | GPU | No GPU found | | Updated | 2025-11-11 15:42 |
This next cell will set a random seed. This seed will control any random operations that occur in the code, ensuring that they produce the same output each time you run the notebook.
# SEED = 1234
# random.seed(SEED)
# np.random.seed(SEED)
Load Data¶
Next, we will load the dataset. The dataset is stored in the AnnData format, which is commonly used for single-cell RNA sequencing data (scRNAseq). This format efficiently handles large datasets and provides useful functions for analysis.
We will use a dataset containing two donor samples from the Domínguez Conde et al. study. Here is a Google Drive link where you can download the data:
Make sure you have downloaded this file (H5AD format) from our Google Drive link and set the correct file path before running the following code.
DATA_PATH = "/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_conde_immune_tissue_two_donors.h5ad"
# read the data
adata = ad.read_h5ad(DATA_PATH)
adata
AnnData object with n_obs × n_vars = 29773 × 36503
obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex'
var: 'gene_name', 'ensembl_id'
print(Counter(adata.obs["batch_condition"]))
# Check the data structure
# adata.X # main data matrix
print(adata.X.shape)
display(adata.obs.head()) # observation meta-data (cells)
display(adata.var.head()) # variable meta-data (genes)
Counter({'A29': 17327, 'A31': 12446})
(29773, 36503)
| cell_type | tissue | batch_condition | organism | assay | sex | |
|---|---|---|---|---|---|---|
| Pan_T7935490_AAACCTGCAAATTGCC | CD4-positive helper T cell | ileum | A29 | Homo sapiens | 10x 5' v1 | female |
| Pan_T7935490_AAACGGGCATCTGGTA | CD8-positive, alpha-beta memory T cell | ileum | A29 | Homo sapiens | 10x 5' v1 | female |
| Pan_T7935490_AAACGGGTCTTGCATT | CD8-positive, alpha-beta memory T cell | ileum | A29 | Homo sapiens | 10x 5' v1 | female |
| Pan_T7935490_AAAGCAATCATCGCTC | CD8-positive, alpha-beta memory T cell | ileum | A29 | Homo sapiens | 10x 5' v1 | female |
| Pan_T7935490_AAAGTAGCAGTCACTA | gamma-delta T cell | ileum | A29 | Homo sapiens | 10x 5' v1 | female |
| gene_name | ensembl_id | |
|---|---|---|
| MIR1302-2HG | MIR1302-2HG | ENSG00000243485 |
| FAM138A | FAM138A | ENSG00000237613 |
| OR4F5 | OR4F5 | ENSG00000186092 |
| RP11-34P13 | RP11-34P13 | ENSG00000238009 |
| RP11-34P13-1 | RP11-34P13 | ENSG00000239945 |
~30k cells (2 donors), ~36k genes
The data contains multiple metadata annotations, such as cell type and tissue information. We will look through the dataset attributes quickly:
- note that the assay is 5'-capture (TCR/BCR)
We can see that we have cell type, tissue, and organism annotations in this dataset. Also, we can see the cell counts from both donors in our immune tissue dataset.
Counter(adata.obs["tissue"])
Counter({'thoracic lymph node': 7573,
'spleen': 5794,
'mesenteric lymph node': 5209,
'lung': 4458,
'liver': 2716,
'bone marrow': 1773,
'duodenum': 735,
'ileum': 379,
'thymus': 354,
'caecum': 339,
'transverse colon': 262,
'sigmoid colon': 133,
'omentum': 30,
'skeletal muscle tissue': 18})
Counter(adata.obs["cell_type"])
Counter({'naive thymus-derived CD4-positive, alpha-beta T cell': 4262,
'memory B cell': 3584,
'macrophage': 2404,
'naive B cell': 2223,
'regulatory T cell': 1582,
'T follicular helper cell': 1430,
'CD8-positive, alpha-beta memory T cell, CD45RO-positive': 1369,
'naive thymus-derived CD8-positive, alpha-beta T cell': 1335,
'mucosal invariant T cell': 1306,
'effector memory CD4-positive, alpha-beta T cell': 1249,
'effector memory CD8-positive, alpha-beta T cell, terminally differentiated': 1181,
'alveolar macrophage': 1180,
'classical monocyte': 1003,
'plasma cell': 715,
'CD8-positive, alpha-beta memory T cell': 579,
'CD4-positive helper T cell': 537,
'CD16-positive, CD56-dim natural killer cell, human': 498,
'alpha-beta T cell': 486,
'gamma-delta T cell': 481,
'lymphocyte': 424,
'CD16-negative, CD56-bright natural killer cell, human': 341,
'animal cell': 303,
'conventional dendritic cell': 251,
'non-classical monocyte': 250,
'germinal center B cell': 141,
'erythroid lineage cell': 135,
'group 3 innate lymphoid cell': 116,
'mast cell': 88,
'plasmablast': 79,
'progenitor cell': 75,
'dendritic cell, human': 58,
'plasmacytoid dendritic cell': 40,
'megakaryocyte': 39,
'pro-B cell': 17,
'precursor B cell': 12})
There are diverse immune cell types and subtypes within this data, which makes for interesting prediction and generation tasks later on!
We also have both gene names as well as ensembl IDs in our data. The gene names are important, since they will make up the cell sentences we create later on, so we will keep gene names as the adata.var_names in our AnnData object. The var_names will be used to create the cell sentences by the C2S code base functions later on.
adata.var
| gene_name | ensembl_id | |
|---|---|---|
| MIR1302-2HG | MIR1302-2HG | ENSG00000243485 |
| FAM138A | FAM138A | ENSG00000237613 |
| OR4F5 | OR4F5 | ENSG00000186092 |
| RP11-34P13 | RP11-34P13 | ENSG00000238009 |
| RP11-34P13-1 | RP11-34P13 | ENSG00000239945 |
| ... | ... | ... |
| ENSG00000277836 | ENSG00000277836 | ENSG00000277836 |
| ENSG00000278633 | ENSG00000278633 | ENSG00000278633 |
| ENSG00000276017 | ENSG00000276017 | ENSG00000276017 |
| ENSG00000278817 | ENSG00000278817 | ENSG00000278817 |
| ENSG00000277196 | ENSG00000277196 | ENSG00000277196 |
36503 rows × 2 columns
adata.var_keys()
adata.var_names
Index(['MIR1302-2HG', 'FAM138A', 'OR4F5', 'RP11-34P13', 'RP11-34P13-1',
'RP11-34P13-2', 'RP11-34P13-3', 'RP11-34P13-4', 'AP006222',
'RP4-669L17',
...
'ENSG00000274175', 'ENSG00000275869', 'ENSG00000273554',
'ENSG00000278782', 'ENSG00000277761', 'ENSG00000277836',
'ENSG00000278633', 'ENSG00000276017', 'ENSG00000278817',
'ENSG00000277196'],
dtype='object', length=36503)
adata.X
<Compressed Sparse Row sparse matrix of dtype 'float32' with 48394900 stored elements and shape (29773, 36503)>
adata.X.data.shape # is a 1D NumPy array storing all the non-zero values of the sparse matrix.
(48394900,)
adata.X.data[:10] # we can see from the count values that we have raw counts
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 6.], dtype=float32)
Looking at the transcript counts, we can see that we are starting from raw counts (counts, not continuous (normalized)).
Data preprocessing¶
Now we will preprocess and normalize the data, following standard processing pipelines (Scanpy reference tutorial: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html).
Cell2Sentence only deviates from the standard preprocessing and normalization pipeline in that the log transformation is done with a base of 10 rather than natural logarithm. We start with simple filtering steps:
adata
AnnData object with n_obs × n_vars = 29773 × 36503
obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex'
var: 'gene_name', 'ensembl_id'
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
filtered out 12559 genes that are detected in less than 3 cells
adata
AnnData object with n_obs × n_vars = 29773 × 23944
obs: 'cell_type', 'tissue', 'batch_condition', 'organism', 'assay', 'sex', 'n_genes'
var: 'gene_name', 'ensembl_id', 'n_cells'
We now can annotate mitochondrial gene counts in our cells, for additional potential filtering steps:
# annotate the group of mitochondrial genes as "mt"
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4,
multi_panel=True,
)
We have the option here to filter out cells based on quality-control metrics based on mitochondrial genes, but following the guidance of the Scanpy tutorial (https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering.html#quality-control), we will stick with a lenient filtering based only on min_cells and min_genes in the previous cell. This filtering step can be revisited if desired to filter out more cells based on QC metrics.
# adata = adata[adata.obs.n_genes_by_counts < 6000, :]
# adata = adata[adata.obs.pct_counts_mt < 50, :].copy()
Normalization¶
Now, we normalize our data. We follow the typical count normalization and log1p transformation steps to normalize single-cell datasets for use with C2S. The only difference is that we perform the log1p transformation with base 10, rather than the default natural logarithm. We empirically found this base 10 log1p transformation to work well for inverse reconstruction of expression using a linear model, which will be shown in tutorial notebook 1.
# Count normalization
sc.pp.normalize_total(adata)
# Lop1p transformation with base 10 - base 10 is important for C2S transformation!
sc.pp.log1p(adata, base=10)
normalizing counts per cell
finished (0:00:05)
adata.X.max()
np.float32(3.408124)
With a base 10 log transformation, the largest value in our count matrix is now ~3.4, which is expected for a base-10 log transformation.
Visualization¶
Now let's process our adata object for UMAP visualization of our dataset. We will run PCA, Scanpy's neighbors algorithm, and then the UMAP algorithm implemented by Scanpy.
sc.tl.pca(adata)
computing PCA
with n_comps=50
finished (0:00:42)
sc.pp.neighbors(adata)
computing neighbors
using 'X_pca' with n_pcs = 50
finished: added to `.uns['neighbors']`
`.obsp['distances']`, distances for each pair of neighbors
`.obsp['connectivities']`, weighted adjacency matrix (0:01:39)
sc.tl.umap(adata)
computing UMAP
finished: added
'X_umap', UMAP coordinates (adata.obsm)
'umap', UMAP parameters (adata.uns) (0:00:22)
We can visualize the UMAPs of the processed data colored by donor, cell type, and tissue metadata
sc.pl.umap(
adata,
color="batch_condition",
size=8,
title="Human Immune Tissue UMAP",
)
sc.pl.umap(
adata,
color="cell_type",
size=8,
title="Human Immune Tissue UMAP",
)
sc.pl.umap(
adata,
color="tissue",
size=8,
title="Human Immune Tissue UMAP",
)
Finally, we will save dataset to disk. This processed dataset will be used for all further tutorials, so make sure to keep this filepath handy for future tutorials.
SAVE_PATH = "/ix/ccdg/storage3/til177/ParkLab/Project_C2S_scFM/code/tutorials/dominguez_conde_immune_tissue_two_donors_preprocessed_tutorial_0.h5ad"
# adata.write_h5ad(SAVE_PATH)